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Abstract 


State of the art benchmarks for Twitter Sen¬ 
timent Analysis do not consider the fact 
that for more than half of the tweets from 
the public stream a distinct sentiment can¬ 
not be cbosen. This paper provides a new 
perspective on Twitter Sentiment Analy¬ 
sis by bighligbting the necessity of explic¬ 
itly incorporating uncertainty. Moreover, 
a dataset of high quality to evaluate solu¬ 
tions for this new problem is introduced 
and made publicly available[^ 

1 Introduction 


As a held of research Twitter Sentiment Analysis 
has gained much attention recently. For a multitude 
of applications such as sales prediction ([Asur and 


Huberman, 2010[|, stock market prediction ([Bollen 


et ah, 20111 or political debate analysis (Diakopou 


los and Shamma, 2010|) it bas been shown to gen¬ 


erate practical value. Twitter Sentiment Analysis 
denotes the task of assigning a given tweet a senti¬ 
ment label of either positive or negative and is an 
integral part of many practical applications. Few 
methods consider neutral as a third class. Flowever, 


dehning a neutral class is a hard task. Pak and 


Paroubek (20101 for example label tweets of pop¬ 


ular news sites as neutral. This assumption is not 
always true. The headline “Multiple children were 
killed in the attack.” would be labeled as negative 
by most human labelers. Thus, we propose an al¬ 
ternative approach to this problem. Its basic idea is 
the explicit incorporation of sentiment uncertainty. 


2 Tbe State of the Art and Its 
Shortcomings 

SemEval-2014 Task 9 ( [Rosenthal et ah, 2014 1 pro¬ 
vides a widely used state of the art benchmark for 


'http://project2.cs.uos.de/TweeDOS 


Twitter Sentiment Analysis and compares the per¬ 
formance of many current approaches. From a 
dataset collected from January 2012 to January 
2013, popular topics have been extracted tbrougb 
identification of frequently mentioned named enti¬ 
ties. Only tweets scoring above a certain polarity 
threshold determined by a sentiment lexicon were 
considered to ensure the inclusion of a sentiment. 
The labels included in the dataset are positive, neg¬ 
ative and neutral, determined by a majority vote of 
five labelers who were told to vote for the sentiment 
they perceive as strongest, when in doubt. This as¬ 
signs tweets to the classes positive and negative 
which do not carry a distinct sentiment. Methods 
performing well on this dataset are shown to be 
able to distinguisb between positive and negative 
sentiment under the assumption that all tweets can 
be assigned one of these labels. Moreover, all test 
tweets include popular named entities of tbe time. 
As the authors themselves noted: The dataset is 
biased. Moreover, the majority vote along with the 
treatment of ambiguity adds noise to the dataset. 
While providing a dataset of high quality for the de¬ 
sired purpose, the general composition of the pub¬ 
lic Twitter stream is not represented by the dataset. 
Hence, only part of the problems arising in practi¬ 
cal analysis of the live stream are addressed with 
the related research. 


3 A General Purpose Dataset 


When analysing the Twitter stream we are inter¬ 
ested in the “Electronic Word of Mouth”(|jansen| 


et al., 2009), i.e. the personal opinions of private 
Twitter users. While labeling tweets, we noticed 
that a relatively high percentage of tweets are spam, 
advertising or marketing messages which we are 
not interested in. Those tweets shall be labeled 
spam. Moreover, it became obvious that for the 
remaining tweets only a small fraction can be dis¬ 
tinctly labeled as positive or negative. The remain¬ 
ing tweets may still include polarity and can often 
































not be labeled neutral while being neither positive 
nor negative. Hence, we propose the new category 
uncertain. Tweets labeled as neutral can be as¬ 
signed to the class uncertain too, as they provide 
no additional information for sentiment analysis 
and can be treated in the same way as tweets of un¬ 
certain sentiment. This approach reduces the noise 
for the sentiment bearing classes which is a desir¬ 
able feature if political or business decisions are 
supposed to be supported by the analysis results. 

To acquire a representative view on the label 
composition of the public Twitter stream, we ran¬ 
domly sampled our dataset from a collection of 
about 43 million tweets with their creation dates 
ranging from June 2012 to August 2013 to mini¬ 
mize topical bias. Each tweet was labeled by two 
human labelers who had to assign it one of the la¬ 
bels positive, negative, uncertain or spam. In total 
14506 tweets have been labeled by 27 labelers. The 
labelers consisted of master’s students from the 
University of Osnabriick, Germany and researchers 
from our group. 

The distribution of labels is shown in figure [T] 
There is a total of 9356 (64.5% of total tweets 
labeled) tweets to which both human labelers as¬ 
signed the same label. Of these tweets 15% are 
spam and 55% are labeled uncertain. A definite 
sentiment label could only be assigned to 30% of 
tweets with 13% being positive and 17% being 
negative. 
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Figure 1: Distribution of labels for tweets which 
both labelers agreed upon. 

These results provide evidence for our claim that 
one has to deal with uncertainty in sentiment anal¬ 
ysis when working with the public Twitter stream. 

To assess the inter annotator agreement we com¬ 
puted Fleiss’ Kappa ( [Fleiss, \91\) resulting in a 
value of fc = 0.45 which can be interpreted as mod¬ 


erate agreement (Fandis and Koch, 19771. At first 
sight this value seems to be rather low but when 
considering the disagreement matrix shown in table 
1 the claim of the necessity to deal with uncertainty 
is further strengthened. 



positive 

negative 

uncert. 

spam 

positive 

1176 

106 

1666 

143 

negative 


1620 

2263 

58 

uncert. 



5138 

914 

spam 




1422 


Table 1: Disagreement matrix showing the absolute 
number of label combinations. 


Fabelers seem to have a very good understand¬ 
ing of what distinguishes the classes positive and 
negative, only 106 tweets have been assigned both 
these labels. The disagreement for positive/spam 
and negative/spam is of similar or even smaller 
magnitude. Fooking at these tweets we noticed 
that the disagreement is mainly related to misun¬ 
derstanding of the labeling instructions or probably 
accidentally clicking the wrong label. Hence, these 
tweets should be omitted from the test set when 
evaluating methods for reliable Twitter Sentiment 
Analysis. 

However, the disagreement between posi¬ 
tive/negative and uncertain is relatively large. 
These tweets make up about 76% of the tweets 
to which the two labelers assigned different labels. 
This indicates that in many cases not even two hu¬ 
mans can agree upon whether a tweet contains a 
distinct sentiment or should be labeled uncertain. 
Systems aiming to perform reliable sentiment anal¬ 
ysis of the public Twitter stream should be able to 
deal with these tweets. While not strictly belong¬ 
ing to the category uncertain they should still be 
labeled as such or at least not be considered for 
sentiment analysis. Another possible approach can 
be to interpret them as rather positive or rather 
negative, depending on the amount of reliability 
the respective application requires. 

Moderate disagreement (914 tweets) can be 
noted for the classes uncertain and spam. Since 
these tweets may still contain useful information 
in the sense of answering the question “What do 
people talk about?” they probably should not be 
considered spam. However, they also should not 
be assigned a sentiment. A system labeling these 
as uncertain will still produce reliable results with 
regard to sentiment analysis. 














As a first approach one can make use of just the 
tweets with two identical labels to asses methods 
for reliable sentiment analysis of the public Twit¬ 
ter stream. However, it should be considered that 
in practice the tweets upon which the labelers dis¬ 
agreed can also appear in the stream and have to 
be handled to provide reliable sentiment results. 
To enable researchers to develop systems which 
meet all the aforementioned requirements the com¬ 
plete dataset including the tweets disagreed upon 
is publicly available. 


4 Conclusion and Outlook 


When performing analysis on the public live stream 
of Twitter with regard to sentiment, it needs to be 
considered that more than half of the tweets can¬ 
not be assigned a distinct sentiment. These tweets 
have to be filtered or explicitly dealt with before 
sentiment analysis takes place. Moreover, one has 
to deal with spam tweets. Spam adds unwanted 
noise by polluting topics with artificially injected 
tweets. Most of the work on spam detection on 
Twitter focusses on catching the users generating 
the spam by looking at the accounts’ behaviour 
over time ( Grier et al., lOlOj Lin and Huang, 20131. 
When performing realtime analysis, a given tweet 
has to be determined to be spam or no spam by 
looking at its content and meta data only as there is 
no time to examine the author’s account in detail. 
New methods have to be developed which are able 
to deal with sentiment uncertainty and spam if re¬ 
liable representations of the public opinion are to 
be acquired from the Twitter stream. The dataset 
presented in this paper can be used to develop and 
evaluate methods for reliable Twitter Sentiment 
Analysis. 
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